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Abstract 

In this paper we study the problem of sorting unsigned genomes by double-cut-and-join operations, where 
genomes allow a mix of linear and circular chromosomes to be present. First, we formulate an equivalent 
optimization problem, called maximum cycle/path decomposition, which is aimed at finding a largest collection of 
edge-disjoint cycles/AA-paths/AB-paths in a breakpoint graph. Then, we show that the problem of finding a largest 
collection of edge-disjoint cycles/AA-paths/AB-paths of length no more than / can be reduced to the well-known 
degree-bounded k-set packing problem with k = 21. Finally, a polynomial-time approximation algorithm for the 
problem of sorting unsigned genomes by double-cut-and-join operations is devised, which achieves the 
approximation ratio y + € ~ 1-4444 + e, for any positive s. For the restricted variation where each genome 
contains only one linear chromosome, the approximation ratio can be further improved to f§ + £ ~ 1.4082 + e. 



Introduction 

A fundamental problem in the study of genome rearran- 
gements is to compute the genomic distance between 
two genomes based on their gene orders, where the 
genomic distance is generally defined as the minimum 
number of evolutionary operations necessary to trans- 
form one genome to another. This problem has been 
extensively studied in the last two decades [1-4]. 

The choices of genome evolutionary operations include 
reversals (also called inversions), translocations, fissions, 
fusions, transpositions and block-interchanges. To unify 
all these classical operations, Yancopoulos et al [4] intro- 
duced a single operation, called double-cut-and-join 
(DCJ). It basically cuts a genome in two places and then 
joins the resulting four ends in a new way. Noticeably, 
computing the genomic distance based on DCJ opera- 
tions can be applied between two genomes that allow a 
mix of linear and circular chromosomes to be present. 

The complexity of computing the genomic distance 
between two genomes seems to largely depend on the 
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availability of gene strand information rather than on 
the choice of evolutionary operations. For instance, the 
problem of sorting by reversals is tractable when gene 
strand information is available [5,6], but becomes 
intractable once gene strand information is not available 
[7]. The same conclusion also applies to the problem of 
sorting genomes by the double-cut-and-join operations. 
The DCJ operation was initially introduced to sort two 
signed genomes in [4,8], where a simple formula was 
derived to compute the genomic distance in linear time. 
It was recently used to sort two unsigned genomes in 
[9,10]; in this case, one instead has to tackle an NP-hard 
optimization problem. 

To tackle an NP-hard optimization problem, it is of 
highly practical interest to develop a polynomial-time 
approximation algorithm with provable performance 
guarantee. For the problem of sorting by double-cut-and- 
join operations, in the case of unsigned uni- chromosome 
genomes, Chen presented in [9] an approximation algo- 
rithm with a performance ratio of -j^ + e ~ 1.4167 + e, 

for any positive s. In the case of unsigned linear gen- 
omes, Jiang et al [10] recently devised a 1.5-approxima- 
tion algorithm. 
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In this paper we study the problem of sorting unsigned 
genomes by double-cut-and-join operations in the more 
general case where genomes allow a mix of linear and 
circular chromosomes to be present. The main goal is to 
devise a polynomial-time approximation algorithm for 
this NP-hard problem. To this end, we first formulate a 
new and equivalent combinatorial optimization problem, 
called maximum cycle/path decomposition, which is 
aimed at finding a largest collection of edge-disjoint 
cycles/AA-paths/AB-paths in the breakpoint graph con- 
structed from two input genomes. Then, we show that 
the problem of finding a largest collection of edge-dis- 
joint cycles/AA-paths/AB-paths of length no more than / 
in the breakpoint graph can be reduced to the well- 
known degree-bounded k-set packing problem, where k = 
21. Finally, we present a polynomial-time approximation 
algorithm for the problem of sorting unsigned genomes 
by DCJ operations and then obtain its approximation 
ratio + s ~ 1.4444 + s, for any positive s. To our best 
knowledge, it is the first polynomial-time approximation 
algorithm of this kind. 

Methods 

Preliminaries 

Genes, chromosomes, and genomes 

A gene is a stretch of DNA with two ends: the 3' end and 
the 5' end. Both are called the extremities of a gene. A 
chromosome is a single double-stranded DNA molecule 
that contains a sequence of genes. It can be either a linear 
molecule or a circular molecule. For a linear chromo- 
some, there is a telomere marker located at each of its 
two ends. A genome is the whole collection of chromo- 
somes in a cell. For example, the following gives a gen- 
ome containing two linear chromosomes and one 
circular chromosome. 
Genome A = {(o, a, c, d, o), (b, e), (o,f, g, o)} 
Here we use the symbol V to represent a telomere 
marker, further indicating that the corresponding chro- 
mosome is linear. An unsigned alphabetical symbol is 
used to represent a gene for which the 3' end and 5' end 
are not yet identified. Accordingly, a genome in this 
representation is called unsigned. On the other hand, if 
every gene is represented as a signed symbol where the 
sign indicates which extremity is the 3' end (and the 
other extremity must be the 5' end), then the corre- 
sponding genome is called signed. For a signed genome, 
Bergeron et al [8] introduced a new and equivalent repre- 
sentation, which is a set of adjacencies between extremi- 
ties from different genes or between telomeres and 
extremities. For example, we may represent a signed gen- 
ome 

A = { (o, -a, +c, +d, o), (+b, +e), [p, + /, +g, o) } 



as 

A = {{o r a t } r {a hr c h } r {c tr d h } r {d tr o} r {b tr e h },{e tr b h } r {o r f h } r {f tr g h } r {g tr o}} 

where a h and a t represents the 5' end and 3' ends of 
gene a, respectively. 
Sorting by double-cut-and-joins 

The double-cut-and-join (DCJ) is an operation that cuts a 
genome in two places and joins the resulting four ends in 
a new way. Specifically, the cuts are applied between two 
adjacent extremities from different genes, between a telo- 
mere and its adjacent gene extremity, or between the two 
extremities of a null chromosome if necessary. The DCJ 
operation was first introduced by Yancopoulos et al [4] 
and later refined by Bergeron et al [8] to unify all the 
classical genome rearrangement events including inver- 
sions, translocations, fissions, fusions, transpositions, 
block-interchanges, circularization and linearization. 

Given two unsigned genomes A and B on the same set 
of genes, the problem of sorting unsigned genomes by 
DCJ operations (UDCJ) is defined to find a shortest 
sequence of DCJ operations that transform one genome 
into the other. The length of such a sequence is called 
the double-cut-and-join distance between two genomes 
A and B, and denoted by d DC ]{A, B). 

Example L Let two unsigned genomes be 

A = { (o, a, c, d, o), (b, e), [p, f, g,o)}, 
B = {[a, b), (o, c, d, o), {o, e, o), (/, g) } . 

Genome A has two linear chromosomes and one circu- 
lar chromosome, while genome B has two linear chromo- 
somes and two circular chromosomes. Sorting A into B 
can, for example, be done in the following four DCJ 
operations, where the places to be cut are underlined: 

A = { [p,_a, c, d, o), (b, e), (o, f, g, o) } 

{ {o, e, b, a, c, d, o), (o, f, g, o) } 

{(a,b) f (o f e f c f d f o) f (o, f,g, o)} 

{ [a, b), (o, e,_c, d, o), (/, g), (o,_o) } 
B = {(a, b), [p, c, d, o), [p, e, 6), (/, g) } . 

We will see later that at least four DCJ operations are 
needed to transform one genome into the other. Therefore, 
the DCJ distance between A and B is d DC j(A, B) - 4. 
The degree-bounded k-set packing problem 
Given a base set S and a collection S of subsets of S, the 
set packing problem asks for the maximum number of 
pairwise disjoint subsets in S. The k-set packing problem 
is a restricted variant of the set packing problem where 
every subset in S has size at most k. If in addition the 
number of occurrences in S of any element is upper 
bounded by a constant A, then it reduces to the degree- 
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bounded k-set packing problem. The following theorem 
states the best-to-date approximability and its detailed 
proof can be found in [11]. 
Theorem 2 ([12]). The degree-bounded k-set packing 

problem can be approximated within ratio - s in 
polynomial time, for any positive s. 

This algorithm is denoted by APPROX-SP(/c, A) and 
will be used in our approximation for computing the DCJ 
distance between two unsigned genomes. Its running 
time complexity is n °( £ ~ 2 ' 6 k 2 logA) as shown in [11]. 

Basic facts 

The breakpoint graph 

The breakpoint graph (also called edge graph or compari- 
son graph) is first introduced in [1] and widely used to 
compute the genomic rearrangement distances. Let A and 
B be two unsigned genomes defined on the same set of n 
genes containing;/^ and l B linear chromosomes, respec- 
tively. We construct the breakpoint graph G{A, B) = (V, E 
= E b U E g ) as follows. Each vertex in | V\ corresponds to a 
distinct gene or a telomere, so | V\ = n + 21 A + 2l B . Every 
adjacency in A forms a black edge belonging to E b and 
every adjacency in B forms a gray edge belonging to E g . It 
is easy to see that \E b \ = n + l A , \E g \ = n + I B , and \E\ = 2n 
+ l A + l B . Moreover, every telomere of genome A (resp. 
genome B) has degree one and is incident to a black (resp. 
gray) edge, whereas every gene has degree four and is inci- 
dent to two black and two gray edges. For short, a telo- 
mere of genome A is referred to as an A-telomere, and a 
telomere of genome B as a B-telomere. Note that the 
number of A-telomeres is not necessarily equal to the 
number of B-telomeres in a breakpoint graph. 

Example 3. Consider the two unsigned genomes A and B 
given in Example 1, where l A = 2 and l B = 2. The break- 
point graph G{A, B) = (V, E = E b \J E g ) is depicted in 
Figurel, in which \V\ = 15, \E b \ = 9, \E g \ = 9, and \E\ = 18. 
The cycle/path decomposition 

A cycle/path in the breakpoint graph G(A, B) is called 
alternating if its edges are alternatively black and gray. 
From now on, whenever we mention cycles/paths in a 
breakpoint graph, they are alternating and edge-disjoint. 
A path is called an AA-path (resp. BB-path) if it con- 
nects two A-telomeres (resp. two B-telomeres) or an 
AB-path if it connects an A-telomere and a B-telomere. 
The length of a cycle/path is referred to as the number 
of black edges that it contains. 

Since every telomere vertex has degree one and every 
gene vertex has two incident black edges and two inci- 
dent gray edges, there always exists a cycle/path decom- 
position of G{A } B) into edge-disjoint cycles, AA-paths, 
AB-paths and BB-paths. A maximum cycle/path decom- 
position refers to a cycle/path decomposition that con- 
tains the maximum number of edge-disjoint cycles, AA- 
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Figure 1 A breakpoint graph between two genomes. The input 
two genomes are A = {(o, a, c, d, o), {b, e), (o, f, q, o)} and B = {{a, 
b), (o, c, d, o), (o, e, o), {f, g)l The solid dots, hollow circles and 
squares are vertices representing genes, the telomere markers of 
genome A and of genome B, respectively. The solid and dashed 
lines are used to represent black and gray edges, respectively. 



paths and AB-paths. Note that this maximum number 
does not take into account any BB-paths. The maximum 
cycle/path decomposition problem is hence defined as the 
problem of finding such a maximum cycle/path decom- 
position of a breakpoint graph. See Figure 2 for an 
example. 

Sorting signed genomes 

Before proceeding to study the problem of sorting 
unsigned genomes by DCJ operations, we take a first 
look at the case of signed genomes. Note that the above 
concept of breakpoint graph extends naturally to two 
signed genomes X an d B It can be done by replacing 




Figure 2 A maximum cycle/path decomposition of the 
breakpoint graph depicted in Figurel. It consists of two cycles, 
one AA-path, two AB-paths and one BB-path. Hence, we obtain c = 
2, l AA = 1 and l AB = 2 so that d DO = 4. 



Chen et al. BMC Bioinformatics 201 1, 12(Suppl 9):S17 
http://www.biomedcentral.com/1471-2105/12/S9/S17 



Page 4 of 8 



each signed gene by its two (unsigned) extremities in the 
relative order. Yancopoulos, et al [4] further proposed to 
close each AA-path into a cycle by adding a gray edge 
connecting two A-telomeres, close each BB-path into a 
cycle by adding a black edge connecting two B-telomeres, 
and close each AB-path into a cycle by identifying the A- 
telomere and B-telomere (with the B-telomere elimi- 
nated). As is customary, no edge is drawn between two 
extremities of the same gene in the breakpoint graph. 

Given a breakpoint graph for two signed genomes X 
and b we note that its cycle/path decomposition is 
unique and trivial because every vertex in G(A, B) has 
degree at most two. The following theorem implies that 
computing the DCJ distance between two signed gen- 
omes can be done in linear time. 

Theorem 4 ([4]). LetX andg be two genomes defined 
on the same set of n genes, then we have 

d DC/ (A,B) = b -c 

where is the number of black edges and c the number 

of cycles in the breakpoint graph G(A,B) after closing all 
the paths into cycles. 
Sorting unsigned genomes 

Theorem 7 that we will present below establishes the 
connection between the cycle/path decomposition of a 
breakpoint graph and the DCJ distance between two 
unsigned genomes. Its proof uses the following two lem- 
mas (i.e., Lemmas 5 and 6). 

Lemma 5. For every cycle/path decomposition of G{A, 
B), there exists a signed version a and of genomes A 
and B such that 

d D cM,B) = b-(c + I AA +I AB ). 

where b is the number of black edges in the breakpoint 
graph G(A, B) and c (resp. I AA and I AB ) is the number of 
cycles {resp. AA-paths and AB-paths) in the given cycle/ 
path decomposition of G{A, B). 

Proof. Note that every gene vertex would be visited 
twice if we traverse all the cycles/paths in a fixed cycle/ 
path decomposition of G(A, B). When a gene vertex 
(e.g., gene a) is visited for the first time, we may assume 
that we are visiting the 5' end of the gene (denoted as 
a h ). When it is visited for the second time, we may 
assume that we are visiting the 3' end of the gene 
(denoted as a t ). To obtain a signed genome X (repre- 
sented as a set of adjacencies), we form an adjacency for 
every two extremities (or one extremity and one A-telo- 
mere) that are connected by a black edge in the given 
cycle/path decomposition. Similarly, to obtain a signed 
genome B, we form an adjacency for every two extre- 
mities (or one extremity and one B-telomere) that are 



connected by a gray edge in the given cycle/path 
decomposition. It is easy to see that the resulting gen- 
omes X an d B are the signed version of genomes A 
and B, respectively. 

Moreover, the breakpoint graph G(A, B), before clos- 
ing its paths into cycles, preserves all the cycles/paths 
from the given cycle/path decomposition of G(A, B)— 
that is, there are still b black edges, I AA AA-paths, I AB 
AB-paths and I BB BB-paths. After closing paths into 
cycles, the breakpoint graph G(A, B) would have 

b = b + I BB black edges (as we close each BB-path into 
a cycle by adding one black edge) and 

c = c + 1^ + I AB + I BB cycles. It hence follows from 
Theorem 4 that 

d D cM, B) = b-c=b + I BB -{c + I AA +I AB + l BB ) = b-{c + I AA + I AB ). 

Lemma 6. For every signed version X and B of genomes 
A and B, there exists a cycle/path decomposition of G(A, 
B) such that 

d D cM,B) = b-{c + I AA +I AB ). 

where b is the number of black edges in the breakpoint 
graph G(A, B) and c {resp. I AA and I AB ) is the number of 
cycles {resp. AA-paths and AB-paths) in this cycle/path 
decomposition of G{A, B). 

Proof. Observe that we would obtain the breakpoint 
graph G{A, B) if we combine two extremity vertices of a 
same gene into a single vertex in the breakpoint graph 
G(A, B) (before closing paths into cycles). 

Therefore, the trivial cycle/path decomposition of 
G(A, B) naturally gives rise to a cycle/path decomposition 
of G(A, B) which preserves the same numbers of black 
edges/cycles/AA-paths/AB-paths/BB-paths (denoted as b, 
c > Iaa> Iab an d I BB , respectively). After closing paths into 
cycles, the breakpoint graph G(A,B) would have 
b = b + I BB black edges and c=c + l AA +l AB + 1 BB 
cycles, as justified in the preceding lemma. By Theorem 4, 
we then have d DCJ (A, B) = b -c = b-{c + I AA + l AB ). 

Theorem 7. Let A and B be two unsigned genomes 
defined on the same set of genes. Then, we have 

d DCJ {A> B ) = b ~ ( c + j aa+ Iab) 

where b is the number of black edges in G(A, B) and c 
{resp., I AA and I AB ) is the number of cycles {resp., AA- 
paths and AB-paths) in a maximum cycle/path decompo- 
sition of G{A, B). 

Proof. Let us consider a maximum cycle/path decom- 
position of G{A, B). By Lemma 5, we would obtain a 
signed version X an d B °f genomes A and B such that 
d DCJ {A, B) = b-[c + Ij^ + l AB ). It means that genome 
X can be transformed into genome g with a sequence 
of d DCJ [A, B) DCJ operations. Observe that this same 
sequence of DCJ operations can be also used to 
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transform genome A into genome B. It hence follows that 
d D cA A > B ) ^ d DcM, B) = b-{c + I AA + I AB ). 

Assume now that a sequence of d DC j(A, B) DCJ opera- 
tions can be applied to transform genome A into gen- 
ome B. Let X be a signed version of genome A in 
which all genes are positive. We then apply the same 
sequence of d DC j{A> B) DCJ operations to the signed 
genome A. The resulting genome g would be genome 
B if we disregard all the gene signs; in other words, g 
shall be a signed version of genome B. Thus, 
d DCJ (A, B) > d DC j(A,B). On the other hand, by Lemma 
6, there exists a cycle/path decomposition of G(A, B) 
such that d DCJ (A,B)>b-(c + I AA +I AB ). Thus, d DC j 
(A, B) > b - (c + I AA + I AB ). 



Results 

The approximation algorithm 

Note that, for a given pair of genomes, the number of 
black edges is fixed without regard to any cycle/path 
decomposition. Theorem 7 hence suggests a way to 
approximate the DCJ distance between two unsigned 
genomes via maximizing the number of cycles/AA- 
paths/AB-paths in a cycle/decomposition of the break- 
point graph. To do so, our proposed approximation 
algorithm performs three subroutines to find edge-dis- 
joint cycles/paths of length one, of length two, and of 
length no more than three, respectively. 
Finding edge-disjoint cycles/paths of length one 
Lemma 8. There exists a maximum cycle/path decompo- 
sition of G(A, B) which contains a largest collection of 
edge-disjoint cycles/ AA-paths/AB-paths of length one. 

Proof Let C be a maximum cycle/path decomposition of 
G(A, B) and Q a largest collection of edge-disjoint cycles/ 
AA-paths/AB-paths of length one in G(A, B). If C 3 is a 
cycle/AA-path/AB-path contained in C\ but not in C 
(please refer to Figure 3), then the maximum cycle/path 
decomposition C shall contain the cycles/paths Q and C 2 
(note that they are not necessarily distinct). In this case, 
we would modify the maximum cycle/path decomposition 



C as follows: take one gene vertex of the only black edge 
of C 3 and re-connect its incident black edges and gray 
edges in a new way. Consequently, Q and C 2 would be 
replaced by the two distinct cycles/paths C 3 and C 4 in C. 
Suppose by contradiction that the above modification 
decreases the number of cycles/AA-paths/AB-paths so 
that the new C is no longer a maximum cycle/path 
decomposition. Note first that this would happen only 
when Ci and C 2 are two distinct cycles/AA-paths/AB- 
paths but C 4 is a BB-path. Since no AA-path in G(A, B) 
could be of length one, C 3 is either a cycle or an AB-path. 
Hence, C 3 and C 4 together use at least two B-telomeres 
and at most one A-telomere, and so do C x and C 2 
together. Since neither C x nor C 2 is a BB-path, they 
together shall use A-telomeres no less than B-telomeres, 
contrasting the previously established fact that they 
together use at least two B-telomeres and at most one A- 
telomere. Continue this process with the remaining 
cycles/AA-paths/AB-paths that are contained in C x but 
not in C. It would necessarily end up with a maximum 
cycle/path decomposition that contains a largest collection 
of edge-disjoint cycles/AA-paths/AB-paths of length one. 

It is worth noticing that there might be no maximum 
cycle/path decomposition of G(A, B) which could con- 
tain all the cycles/AA-paths/AB-paths of length one. 

Lemma 9. The problem of finding a largest collection 
of edge-disjoint cycles, AA-paths and AB-paths of length 
one in the breakpoint graph G(A, B) is solvable in poly- 
nomial time. 

Proof. We can transform a problem instance of finding 
a largest collection of edge-disjoint cycles/AA-paths/AB- 
paths of length one into an instance of the 2-set packing 
problem, where the base set S contains all the edges of 
G(A, B) and each subset of the collection S is comprised 
of edges of a cycle/AA-path/AB-path of length one in G 
(A, B). The 2-set packing problem can be reduced to 
the maximal matching problem which is well-known to 
be solvable in polynomial time by a simple greedy algo- 
rithm [13]. 



(I) 



(II) 




o 



c 2 
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Figure 3 Two possible cases of the cycles/paths C 3 of length one. (I) a cycle, and (II) an AB-path. 
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Following Lemma 9, we assume from now on that 
there does not exist any cycle/AA-path/AB-path of 
length one in the breakpoint graph G(A, B). 
Finding edge-disjoint cycles/paths of length two or three 

The following lemma gives an upper bound on the num- 
ber of distinct cycles/paths of length less than or equal to 
/ that traverse a common edge. Although this upper 
bound is no way the tight one, it is already enough for 
our purpose to devise an approximation algorithm. 

Lemma 10. Every edge in the breakpoint graph G(A, B) 
belongs to at most (2 2/+1 x I 2 ) distinct cycles/ AA-paths/ 
AB-paths of length less than or equal to I. 

Proof. Let us consider a breadth-first traversal of edges 
in the breakpoint graph G(A, B) that starts at a given 
edge e, and then count all the cycles/AA-paths/AB-paths 
of length less than or equal to / that use the edge e. Note 
that every vertex is incident to at most two black edges 
and at most two gray edges and also that every edge is at 
distance at most (21 - 1) from another edge of the same 
cyle/path of length less than or equal to /. The traversal 
of every such cycle/path will end at two edges at distance 
i and j from the root edge e, respectively, such that i ,j > 
0 and / + j < 21 - 1. If we fix values for i and there are 
at most 2 l x 2 ; = 2* +; < 2 2/_1 cycles/paths whose traversals 
end at two edges in distance i and ; from the root edge e, 
respectively. For all the combinations of values i and ; 
such that i < 21 - 1 and / < 21 - 1, there are at most 21 x 
21 x 2 2/_1 = 2 2/+1 x l 2 cycles/paths to be reached; in other 
words, there are at most 2 2/+1 x l 2 cycles/paths of length 
less than or equal to / that use the edge e. 

To find a largest collection of edge-disjoint cycles/ 
paths of length no more than /, we may construct a col- 
lection S of subsets of the base set S, where S is com- 
prised of all the edges in G(A, B) and each subset of S is 
comprised of edges of a distinct cycle/AA-path/AB-path 
of length no more than / in G(A, B). Then, we obtain an 
instance (S, S) of the /c-set packing problem where k - 
21 Further by the above lemma, we obtain the following 
observations. 

Corollary 11. Finding a largest collection of edge-disjoint 
cycles/AA-paths/AB-paths of length two can he trans- 
formed into an instance of the degree-bounded 4-set pack- 
ing problem. 

Corollary 12. Finding a largest collection of edge-disjoint 
cycles/AA-paths/AB-paths of length no more than three 
can he transformed into an instance of the degree-bounded 
6-set packing problem. 

It is worth noting that, if, as previously done in [10] 
on the breakpoint graph, all the paths are closed into 
cycles to make a maximum cycle decomposition instead 
of a maximum cycle/path decomposition, we would not 
know whether the above corollaries (and hence our 
approximation algorithm proposed later) are still valid. 
Two lemmas below further follow from Theorem 2. 



Lemma 13. The problem of finding a largest collection 
of edge-disjoint cycles/AA-paths/AB-paths of length two 
in the breakpoint graph G(A, B) can be approximated 
with ratio |- - s in polynomial time, for any positive s. 

Lemma 14. The problem of finding a largest collection 
of edge- disjoint cycles/AA-paths/AB-paths of length no 
more than three in the breakpoint graph G(A, B) can be 
approximated with ratio y - s in polynomial time, for 
any positive s. 
Algorithm details 

Let A and B be two unsigned genomes defined on the 
same set of genes. Given a breakpoint graph G{AJ$), our 
proposed algorithm for the cycle/path decomposition is 
summarized below: 

1. Find a largest collection Q of cycles/AA-paths/AB- 
paths of length one by a greedy algorithm; 

2. Remove from the breakpoint graph all the edges 
used by the cycles/AA-paths/AB-paths of Cy 

3. Find a collection C 2 of cycles/AA-paths/AB-paths of 
length two by Algorithm APPROX-SP(/c, A); 

4. Find a collection C 3 of cycles/AA-paths/AB-paths of 
length no more than three by Algorithm APPROX-SP(/c, 
A); 

5. Decompose the remaining edges arbitrarily into a 
collection C 4 of cycles/AA-paths/AB-paths/BB-paths; 

6. Output either C 1 U C 2 U C 4 or C 1 U C 3 U C 4 , 
depending on which one has the larger size. 

For a maximum cycle/path decomposition, let r 2 
denote the number of cycles/AA-paths/AB-paths of 
length two, r 3 the number of cycles/AA-paths/AB-paths 
of length three, and r' the total number of cycles/AA- 
paths/AB-paths. Let r be the total number of cycles/AA- 
paths/AB-paths in the cycle/path decomposition 
returned by our proposed algorithm. The cycles/AA- 
paths/AB-paths of length one are all excluded from the 
above counts since it would not affect the worst-case 
algorithmic performance. We find the worst-case 
approximation ratio of our algorithm by solving the fol- 
lowing optimization problem: 

b-r' 

max b-r 
subject to r 2 + r 3 < r, 



(f -e)r 2 <r', 

(l-s){r 2 +r 3 )<r f , 

b>\, 
r 2 ,r 3 >0. 

The second constraint is due to the fact that every 
cycle/AA-path/AB-path of length four or larger uses at 
least four black edges. The third and fourth constraints 
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follow from Lemmas 13 and 14, respectively. By solving 
the above fractional linear programming problem 
(please refer to Lemma 2.3 of [14]), we would obtain the 
maximum objective function value being + 8, which 
indeed gives the performance ratio of our proposed 
algorithm. 

Theorem 15. The problem of sorting unsigned genomes 
by DC] operations can be approximated within 
ratio + s ~ 1.4444 + 8 in polynomial time, for any 
positive s. 

Sorting unsigned permutations 

A genome can be represented as an unsigned permuta- 
tion when it contains only one linear chromosome. For 
this restricted case, we can improve the approximation 
ratio further by applying the following result from [14] 
in place of Lemma 13. 

Lemma 16 ([14]). Let A and B be two unsigned uni- 
chromosome genomes. The problem of finding a largest 
collection of edge- disjoint cycles/ AA-paths/AB-paths of 
length two in the breakpoint graph G(A, B) can be 
approximated with ratio j - 8 in polynomial time, for 
any positive s. 

In view of this lemma, the third constraint in the 
above fractional linear programming problem can be 
replaced by the inequality (y - s)r 2 < r , which hence 
leads to the following theorem. 

Theorem 17. The problem of sorting unsigned permu- 
tations by DC] operations can be approximated within 
ratio H + 8 ~ 1.4082 + 8 in polynomial time, for any 
positive £. 

Conclusions 

Since the introduction of the NP-hard problem of sort- 
ing unsigned genomes by double-cut-and-join opera- 
tions in [9], the polynomial-time approximation 
algorithms have been developed only under two 
restricted genome models. The first one is intended for 
sorting uni-chromosome genomes and its best-to-date 
performance ratio is + £ ~ 1.4167 + 8, for any posi- 
tive s [9]. The second one is intended for sorting linear 
genomes and its best-to-date performance ratio is 1.5 
[10]. In this paper, we have presented an approximation 
algorithm for the problem of sorting unsigned genomes 
by double-cut-and-join operations in the general case 
where genomes allow a mix of linear and circular chro- 
mosomes to be present. The performance ratio thus 
achieved is + 8 ~ 1.4444 + 8, for any positive s. In 
addition, for the first restricted genome model men- 
tioned above, an improved performance ratio of 
|| + 8 ~ 1.4082 + 8 is also achieved. However, the pro- 
posed algorithm is mainly of theoretical interest rather 
than the practical use, due to its huge factor polynomial 
running time n o{s ~ 26 ) 



Conceptually, our proposed algorithm operates in the 
same spirit as many previous algorithms for approximat- 
ing the genomic distance via genome rearrangement 
operations [1,10,14,15]. However, when we began this 
work, it was not clear whether the problem of finding a 
largest collection of edge-disjoint cycles/AA-paths/AB- 
paths of length two or three can be reduced to a degree- 
bounded k-set packing problem (rather than a general k- 
set packing problem). In this paper we established this 
reduction, which then leads to the improved approxima- 
tion ratios. 
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